REGULAR EXPRESSION DLL A Dynamic Link Library for Microsoft Windows by Windfall Software Systems 40 Windfall Lane Marlboro, NJ 07746 CompuServe ID: 71330,3614 Copyright Windfall Software Systems, 1989 All Rights Reserved A. Software License 1 B. Concepts and Facilities 2 1. Package Contents 2 2. Regular Expressions 3 3. Sample Regular Expressions 5 C. Functions Overview 8 D. Applications 10 E. Demonstration Program 13 F. Function Reference 14 1. RxMatch - Match a Regular Expression 15 2. RxExtract - Extract a Matching Group 17 3. RxReplace - Replace Placeholders 19 4. RxMsgText - Build Error Message 21 G. Registration Form 23 A. SOFTWARE LICENSE You are granted a limited licence to use the Regular Expression DLL on a private, non-commercial basis and to make copies of this package and distribute them to other users, under the following conditions: þ This package must be copied and/or distributed in unmodified form, complete with the file containing this licence information. þ No fees or other compensation may be requested or accepted by any licensee, except that clubs and user groups may charge a nominal fee not to exceed $10 for expenses and handling. þ No part of the software contained in this package can be distributed with any other product or service. If you want to use this software in a different way, you can use this package for evaluation only. If it fits your requirements, you can obtain a typical nonexclusive licence to use the software and related documentation, on a single computer at a time, and distribute derivative works. To obtain this licence, mail the Registration Form (last page) and a registration fee of $10 to the address shown on the form. The licence will be mailed to you. You can also use this form to order the complete source code of this package and its internal documentation. You, the Customer, assume all responsibility for the selection of this package as appropriate to achieve the results intended by the Customer. The software of the Regular Expression DLL is provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to the implied warranties of merchantability and fitness for a particular purpose. In no event will Windfall Software Systems be liable for any damages arising out of the Customer use of the software, including, loss of data or profits, loss of use or other economic loss, or indirect, incidental, consequential or special damages of any kind, even if Windfall Software Systems has been advised of the possibility of the same. In no event shall Windfall Software Systems' liability for any damages exceed the price paid for the license to use the software, regardless of the form of the claim. 1 B. CONCEPTS AND FACILITIES This package constitutes a dynamic link library facility (DLL) designed to perform regular expression searches and other related operations in a Windows application. A regular expression is a string that defines a pattern of text by using certain special characters. Those special characters let you specify optional choices, repetitions and character classes in such a way that a given regular expression matches not one string but all strings having some selected properties. Regular expressions are often supported by text editors. Most likely, the one you use provides a generalized search command that uses some form of regular expressions to define text patterns. Although programmers make use of regular expressions while editing programs and using other programming tools, it is not a common practice to use regular expression routines in the applications coming out from this work. Yet, as we will try to demonstrate in a few examples, regular expression routines can simplify many typical operations. 1. Package Contents The following files make up the evaluation version of the package: þ WSSRX1.TXT Documentation (this file). þ WSSRX1.EXE Dynamic link library. þ WSSRX1.LIB Import library for WSSRX1.EXE. þ WSSRX.H Interface definitions/declarations. þ RXTEST.EXE Demonstration program. The header file (WSSRX.H) and the import library (WSSRX1.LIB) are necessary to compile and link programs that use the dynamic link library. The header file should be copied to the directory pointed by the INCLUDE environment variable. The import library - to the directory defined by the LIB environment variable. This is the simplest setup. If you choose different directories, you will have to adjust the 2 #include directives and the linker parameters. The library itself (WSSRX1.EXE) is the only component needed by a compiled program. 2. Regular Expressions Regular expressions describe more or less complex text patterns. A simple pattern is merely a character, such as x, or a string of characters taken literally, such as ABC. Regular expressions like that represent specific strings, character for character. More complex patterns use special characters that represent not individual strings but specific context. The following items define elementary regular expressions matching a single character: þ An ordinary character matches itself. An ordinary character is a character other than one of the following special characters: \ ^ $ . [ | { } * + ? þ A backslash (\) followed by a character matches that character, even if the character alone is special. Note that the C language uses this character in a similar fashion, so to specify a backslash in a C string constant, you have to use it twice, e.g. "\\{". þ A caret (^) matches itself except when it appears at the beginning of the entire regular expression. The meaning of this character at the beginning is defined later. þ A dollar sign ($) matches itself except when it appears at the end of the entire regular expression. The meaning of this character at the end is defined later. þ A period (.) matches any character. þ A non-empty string enclosed in square brackets ([]) is called a character class. It matches any one character in that string. If, however, the first character of the string is a caret (^), the character class matches any character except the characters in the string. The caret represents itself when it appears somewhere else in the string. The minus sign (-) may be used to represent a range of consecutive characters. For example, a-z represents a lower case character. The minus represents itself when it appears first (possibly after an initial caret) or last in the string. The right square bracket 3 stands for itself if it is the first character within the string (after an initial caret, if any). All other characters defined above as special represent themselves in a character class (e.g. [\.] means "backslash or period"). The following rules can be used recursively to construct more complicated regular expressions: þ An elementary regular expression is a regular expression matching a single character as described earlier. þ A concatenation of regular expressions is a regular expression that matches the concatenation of strings matched by each component of the concatenation. þ An alternative, i.e. two regular expressions separated by an |, is a regular expression that matches a string matched by at least one of the components. If both components match, the preference is given to the left one. þ A group, i.e. a regular expression enclosed in braces ({}), is a regular expression that matches the same string as the enclosed expression. þ An iteration, i.e. an elementary regular expression or a group followed by an asterisk (*), is a regular expression that matches zero or more occurrences of the string matched by the expression preceding the asterisk. If there is any choice, the longest leftmost string that facilitates a match is chosen. þ A non-empty iteration, i.e. an elementary regular expression or a group followed by a plus (+), is a regular expression that matches one or more occurrences of the string matched by the expression preceding the plus. If there is any choice, the longest leftmost string that facilitates a match is chosen. þ An option, i.e. an elementary regular expression or a group followed by a question mark (?), is a regular expression that matches zero or one occurrence of the string matched by the expression preceding the question mark. If there is a choice, the match with one occurrence is chosen. þ A caret (^) at the beginning of a regular expression constrains the match to an initial segment of a string. 4 þ A dollar sign ($) at the end of a regular expression constrains the match to a final segment of a string. The above rules introduce some special characters that behave like operators and establish the precedence criteria for them. For example, because iterations bind single characters or groups, a|b+ matches a single a or one or more b's. To match one or more a's or one or more b's, we would have to use {a|b}+. Note, that a caret and a dollar sign are somewhat different than other operators. They do not have any special meaning unless the stand on the beginning or end. 3. Sample Regular Expressions Each of the examples below consists of three parts: regular expression one or more strings to be matched against that expression explanation The matching part of each sample string is shown to the right of the string. [a-zA-Z0-9] (((ABC))) A Matches a single letter or digit. [a-zA-Z0-9]+ (((ABC))) ABC 201-777-1212 201 Matches a "word", i.e. a string of letters and/or digits delimited by something else. [a-zA-Z][a-zA-Z0-9]* (32, -x2) x2 5 Matches an ALGOL-like identifier, i.e. a string of letters and/or digits starting with a letter. Y{AB|CD}Z XXYCDZXX YCDZ Matches YABZ or YCDZ. [a-zA-Z]*{ie|ei}[a-zA-Z]* selected properteis properteis Matches words, i.e. strings of letters, that contain ie or ei (and are often misspelled). [0-9]+{\.[0-9]+}?{E{+|-}?[0-9]+}? - 12.79; 12.79 a = 3.14E-2, 3.14E-2 Matches a decimal number with an optional fraction and an optional exponent. As you will see later, the groups surrounded by braces not only establish scopes for the ? operators but also can be used to extract parts of the matched number. [a-z]*[,.?!]? xxx, yyyy. xxx, 1234567890 match on a null string Matches a possibly empty string of letters followed by an optional delimiter? Yes, but most likely this is not what you want. This pattern will show a match with any string, because all of its parts are optional. It may match something non-trivial if the test string starts with the right combination of characters. Otherwise, it will match the empty string that exists at the beginning of any string. All strings start with "a possibly empty string of letters followed by an optional delimiter". Be careful with the ? and * operators. In 6 most cases, they have to be used within some non-empty context to yield good results. Sometimes, you can use them in the way presented here but you should check if the matching substring is non-empty. In both cases shown above, the library indicates a match. To find out that the match is non-trivial, you have to check if the size of the matching string is nonzero. 7 C. FUNCTIONS OVERVIEW The primary function in this library is the RxMatch function. This function searches a given string looking for a substring that matches a given regular expression. This operation consists of two steps. First, the function parses the regular expression and converts it into an internal form which facilitates faster matching. This step may fail if the function discovers a syntax error in the regular expression. If the parsing is successful, the next step tries to match the internal form of the expression with the given string. If there is a matching substring, the function returns a non-zero value. Otherwise, it returns zero. A call to the RxMatch function always supplies a regular expression and a string to match. However, to avoid repeated parsing of the same regular expression, the library provides a caching facility. The cache holds a number of recently used regular expressions, together with their translations. If the expression supplied in the current call is the same as one in the cache, the parsing is bypassed. The cache capacity depends on the size of the regular expressions held in it and their complexity. For typical applications, you can assume that a few most recent expressions will be found in the cache. The RxMatch function can respond with more information than just a boolean return value. Its first parameter points to an area (provided by the calling program) where the function places the additional information. This area, called the feedback array, consists of 1-16 elements of the following type: typedef struct { int pos; int size; } RX; When an RxMatch call is unsuccessful because no match exists the feedback array is cleared to zeros. However, if the call failed due to some formal errors in the regular expression, a nonzero error code is inserted into the first word of the array (rx[0].pos). The error code can be converted into a text message by the RxMsgText function. A successful call to RxMatch places the relative position and the size of the matched substring in the pos and size 8 fields of the first feedback array element. The remaining elements receive the positions and sizes of the substrings that match groups in the regular expression. For example, when we match: {[0-9]*}A+{[0-9]*} with XXX5678AAA876... the feedback array will have the following elements: 3; 10 - position and size of the match (5678AAA876) 3; 4 - first group (5678) 10; 3 - second group (876). The feedback array, filled up by a call to the RxMatch function, can be used by subsequent calls to the RxExtract and RxReplace functions. The RxExtract function copies one of the substrings identified by the match into another field. It can extract the whole match or a group match. The RxReplace function operates a little bit like the sprintf function. It replaces placeholders in a given string with the matching substring and/or the group matches. Naming conventions In variables and parameters , we use the prefix lsz to denote a long pointer to a zero terminated character string and the prefix rx for an feedback array (RX []). Otherwise, we follow the conventions from the Windows SDK. 9 D. APPLICATIONS 1. Elimination of trailing white-space Truncate a given null-terminated character string lszText, eliminating all trailing white-space characters. RX rx[1]; if (RxMatch(rx,1,"[ \t\n\v\f\r]+$",lszText)) lszText[rx[0].pos] = '\0'; Note that there is a blank character at the beginning of the character class. The escape sequences define respectively: tab, new line, vertical tab, form feed and carriage return. They have nothing to do with the escape sequences used by regular expressions. During compilation, the compiler changes them into the corresponding ASCII codes. When the RxMatch is called, the character class will contain six characters: blank, tab, new line, etc. We use the $ at the end to restrict the eventual match to the end of the tested string. 2. Reduction of white-space In a given, null-terminated, character string lszText, replace each sequence of white-space characters with a single space. LPSTR lpx; RX rx[1]; lpx = lszText; while (RxMatch(rx,1,"[ \t\n\v\f\r][ \t\n\v\f\r]+",lpx)) { lpx += rx[0].pos; *lpx++ = ' '; lstrcpy(lpx,lpx + rx[0].size); } In this example the regular expression matches two or more white-space characters. The feedback array is used to gain to the string and shrink it at the match point. 10 3. Parsing of a filename Assuming that a string lpFile contains a DOS file name, divide it into components: drive specification (X:), path (A\B\C\) and name proper. All of the components are optional. if (RxMatch(rx,4,"{.:}?{.*}{[^\\\\]+} *",lpFile)) { RxExtract(rx,1,lpFile,szDrive,sizeof(szDrive)); RxExtract(rx,2,lpFile,szPath,sizeof(szPath)); RxExtract(rx,3,lpFile,szName,sizeof(szName)); } The first group matches an optional drive designation. The last group matches whatever follows the last backslash or, if there is no backslash, anything that follows the drive designation. The middle group matches everything in between (i.e. strings like xxx\yyy\). The iteration at the end removes trailing spaces from the third group. Note, that this is not a test that lpFile contains a valid file name. The RxExtract calls extract the group matches and place them in szDrive, szPath and szName. 4. Validation of input Check if a given input field szIn contains a valid hexadecimal number and extract its components for further processing. char szHex[10]; RX rx[3]; int value; if (RxMatch(rx,3,"^ *{[+-]?} *{[0-9a-fA-F]*} *$",szIn)) { RxExtract(rx,2,szIn,szHex,sizeof(szHex)); --- convert szHex to value --- if (rx[1].size && szIn[rx[1].pos] == '-') value = -value; } else --- error - invalid input --- Here we anchored the match with ^$ so that the whole string is checked. Any extraneous characters (e.g. "-5b x" or "+ -25") will cause a mismatch. Spaces are accepted on both ends and after an optional sign. The feedback array is used to check if the number is negative. 11 5. Format change Check if a given input field szIn contains two decimal numbers separated with a comma and/or spaces. If it does, transfer them into another field with the following format: Length=x, Width=y where x and y are the numbers extracted from szIn. char szAux[50]; /* The output field */ RX rx[4]; if (RxMatch(rx,4,"^ *{[0-9]+} *{,| } *{[0-9]+} *$",szIn)) { strcpy(szAux,"Length=%1, Width=%3"); RxReplace(rx,4,szIn,szAux,sizeof(szAux)); } else --- error - invalid input --- In this example we use two groups to access the data. An additional group (i.e. {,| }) is used to override the usual interpretation of the regular expression operators. The alternative "comma or space" has to be enclosed in brackets. Without them the scope of the | operator would be too large: " *{[0-9]+} *," OR " *{[0-9]+} *". The RxReplace function is used to replace the two placeholders (%1, %3) with the substrings matching the first and the third group. 6. Parsing of a telephone number Search a given string szText for a substring resembling a telephone number. static char szRex[] = "{(?[2-9][0-9][0-9])?}? *-? *" "{[2-9][0-9][0-9]} *-? *" "{[0-9][0-9][0-9][0-9]}"; RX rx[4]; if (RxMatch(rx,4,szRex,szText)) { RxExtract(rx,1,lpText,lszArea,6); RxExtract(rx,2,lpText,lszExch,4); RxExtract(rx,3,lpText,lszNmbr,5); } When the match is successful, we transfer the area code, exchange and the last four digit number to the fields pointed to by lszArea, lszExch, lszNmbr. 12 E. DEMONSTRATION PROGRAM This simple Windows application (RXTEST.EXE) can be used to learn how to compose regular expressions and what to expect from the functions provided by the library. When you first start RXTEST, it displays a dialog box with a number of empty text fields and one button (Execute). You can enter text into the top three fields. The remaining fields are used by the program to display results. The description of all the fields follows. Pattern A regular expression. Search area Any text to be searched for a substring matching the regular expression entered in the Pattern field. Replacement area Text to be passed to the RxReplace function. Return code The return value from the RxMatch function followed by an error message. Replacement result Text produced by the RxReplace function from the text in the replacement area field. 0: 1: 2: 3: 4: 5: 6: 7: Texts extracted from the search area field by the RxExtract calls referencing a group with the given number. The first group (number 0) is defined as the full match. The subsequent groups are defined by the brackets ({}) in the regular expression. Use the TAB/BACKTAB keys to move between the three input fields. Click the EXECUTE button or press the ENTER key to perform the match and related operations. Using different combinations of the input values, you can play with all the functions in the library. 13 F. FUNCTION REFERENCE This chapter contains a list of functions from the Regular Expression DLL. The documentation for each function is organized in the way similar to that used in the Windows SDK. All function prototypes and other related declarations are contained in the header file wssrx.h. This header should be included (after windows.h) into all program files that refer to the functions in the DLL. The actual format of the include directive depends on the placement of the file. For example: #include "wssrx.h" current directory, #include a directory defined by the INCLUDE environment variable, #include a subdirectory SUB of the above directory. Usually, you make direct calls to the library functions and use the import library (wssrx1.lib) when linking the program. In this case Windows will perform dynamic linking when the program is first loaded into memory. You can defer dynamic linking using the following technique: HANDLE hRx; /* Library handle */ FARPROC lpfnRxMatch; /* To the RxMatch function */ hRx = LoadLibrary("WSSRX1.EXE"); lpfnRxMatch = GetProcAddress(hRx,OV_RXMATCH); - - - if (lpfnRxMatch(rx,3,"[0-9]+",szInput)) { - - - } - - - FreeLibrary(hRx); The header file wssrx.h contains definitions of four symbols (OV_xxxxx) that can be used with the GetProcAddress call to retrieve the addresses of the respective Rx functions. 14 1. RxMatch - Match a Regular Expression BOOL RxMatch(rx, nLim, lszRex, lszTxt) This function searches the lszTxt string for a substring that matches the regular expression given by the lszRex string. The result of the search is reflected by the return value and by the values set in the feedback array defined by the rx and nLim parameters. Parameter Type/Description rx RX FAR [] Specifies an area to be used as the feedback array. If rx is NULL, no feedback information is returned. nLim int Specifies the number of elements in the feedback array rx. If nLim is zero or negative, no feedback information is returned. lszRex LPSTR Points to a null-terminated string that specifies the regular expression. lszTxt LPSTR Points to a null-terminated string that is to be matched with the regular expression. a) Return Value The return value specifies the outcome of the function. It is non-zero if the match was successful. Otherwise, it is zero. b) Comments The return value of zero means that either there was no match with the regular expression or the parameters received by the function were invalid. These two cases can be distinguished only when the feedback array is non-empty (i.e. when rx is not NULL and nLim is greater than 0). When the parameters received by the function are in error, the function places a non-zero error code in the first word of the feedback array. When everything is valid and only the match is unsuccessful, the function clears that word. All possible values of the error codes are defined in the wssrx.h header file as ERR_xxxx manifest constants. The 15 RxMsgText function can be used to convert an error code to a text message. The feedback array does not have to be initialized in any way before the call to RxMatch. Only the first 16 entries of the array are effectively used by the functions in this library. 16 2. RxExtract - Extract a Matching Group LPSTR RxExtract(rx, nN, lszTxt, lszDst, nSize) This function assumes that the rx and lszTxt parameters have been used as arguments in a successful RxMatch call. The function copies a substring of lszTxt (adding the terminating null character), to the destination area specified by lszDst. The substring extracted by the function is defined as follows: þ If nN is zero, it is the substring matching the whole regular expression in the RxMatch call. þ If nN is greater than zero, it is the substring matching the nN-th group of the regular expression. If no group in the regular expression corresponds to nN, the substring is empty. If the extracted substring (including the terminating null) is longer than the destination length (nSize), it is truncated to fit in the destination area. Parameter Type/Description rx RX FAR [] Specifies an area used as a feedback array in a successful RxMatch call. nN int Specifies a group number in the regular expression used by the RxMatch call. This value should not exceed the value of the nLim argument passed to RxMatch. lszTxt LPSTR Points to a null-terminated string that has been matched with the regular expression using the RxMatch call. lszDst LPSTR Points to the buffer that receives the extracted substring. nSize int Specifies the number of characters (including the last null character) that can be copied to the buffer. 17 a) Return Value The return value points to the extracted substring (same as lszDst). b) Comments The extracted substring is always terminated with the null character. If the receiving buffer is too short to accommodate the entire extracted substring, the function copies nSize-1 leftmost characters and appends the null character. 18 3. RxReplace - Replace Placeholders LPSTR RxReplace(rx, nLim, lszTxt, lszDst, nSize) This function assumes that the rx, nLim and lszTxt parameters have been used as arguments in a successful RxMatch call. The function modifies the string defined by lszDst, replacing placeholders embedded in it with substrings extracted from the string defined by lszTxt. A placeholder is a single hexadecimal digit (0,1,...,e,f) preceded by the percent character (e.g. %2 or %a). The placeholder of the form %n is replaced by the substring of lszTxt determined as follows: þ If n is zero, it is the substring matching the whole regular expression in the RxMatch call. þ If n is greater than zero, it is the substring matching the n-th group of the regular expression. If no group in the regular expression corresponds to n, the substring is empty. If the modified string (including the terminating null) is longer than the destination length (nSize), it is truncated at the end to fit in the destination area. Parameter Type/Description rx RX FAR [] Specifies an area used as a feedback array in a successful RxMatch call. nLim int Specifies the size of the rx array used in the RxMatch call. lszTxt LPSTR Points to a null-terminated string that has been matched with the regular expression using the RxMatch call. lszDst LPSTR Points to the null-terminated string that is to be modified. nSize int Specifies the number of characters that can be used in the area defined by lszDst. 19 a) Return Value The return value points to the last position (the terminating null character) of the modified string (lszDst). The value is NULL if during the operation some characters were lost due to lack of free space in lszDst. b) Comments The RxReplace function uses the area defined by lszDst and nSize as the only work area. It replaces the placeholders one by one starting from the left. After each replacement, the destination string expands, shrinks or does not change the length, depending on what replaces the placeholder. The destination area should be large enough to accommodate any one of those intermediate results. 20 4. RxMsgText - Build Error Message int RxMsgText(rx, lszMsg, nSize) This function assumes that the rx parameter have been used as an argument in an unsuccessful RxMatch call. If the RxMatch call failed because of errors, the function creates an error message in the buffer defined by the lszMsg parameter. If the RxMatch was unsuccessful because there was no match with the regular expression, the function puts an empty string in the buffer. If the message text string (including the terminating null) is longer than the buffer length (nSize), it is truncated at the end to fit in the buffer area. Parameter Type/Description rx RX FAR [] Specifies an area used as a feedback array in an unsuccessful RxMatch call. lszDst LPSTR Points to the buffer that receives the error message text. nSize int Specifies the number of characters (including the last null character) that can be copied to the buffer. a) Return Value The return value specifies the length of the error message. It is zero if there was no error condition detected. b) Comments The function can respond with one of the following messages: Rex or Txt parameter is NULL Rex too long {{{ }}} too deep Missing right brace Missing left brace 21 Iteration (+*) on empty string Nested iteration Invalid range Missing right bracket Incomplete escape sequence Logic error 22 G. REGISTRATION FORM Name _____________________________________ Company _____________________________________ Title _____________________________________ Address _____________________________________ City, State ________________________ Zip ________ Phone _____________________________________ Registration fee # computers ____ x $10 ______ Documentation and the source code on a disk - add: # computers ____ x $15 ______ TOTAL ............................................... ______ Diskette format for the source code (choose one) 5.25" disk _____ 3.5" disk _____ Mail this form with your payment to: Windfall Software Systems 40 Windfall Lane Marlboro, NJ 07746 23 ----------------end-of-author's-documentation--------------- Software Library Information: This disk copy provided as a service of The Public (Software) Library We are not the authors of this program, nor are we associated with the author in any way other than as a distributor of the program in accordance with the author's terms of distribution. Please direct shareware payments and specific questions about this program to the author of the program, whose name appears elsewhere in this documentation. If you have trouble getting in touch with the author, we will do whatever we can to help you with your questions. All programs have been tested and do run. To report problems, please use the form that is in the file PROBLEM.DOC on many of our disks or in other written for- mat with screen printouts, if possible. The P(s)L cannot de- bug programs over the telephone. Disks in the P(s)L are updated monthly, so if you did not get this disk directly from the P(s)L, you should be aware that the files in this set may no longer be the current versions. For a copy of the latest monthly software library newsletter and a list of the 2,000+ disks in the library, call or write The Public (Software) Library P.O.Box 35705 - F Houston, TX 77235-5705 (713) 665-7017